Update more parquet APIs to use u64#57
Update more parquet APIs to use u64#57alamb wants to merge 3 commits intokylebarron:kyle/change-parquet-u64from
Conversation
| } | ||
|
|
||
| fn get_bytes(&self, _start: u64, _length: usize) -> Result<Bytes> { | ||
| fn get_bytes(&self, _start: u64, _length: u64) -> Result<Bytes> { |
There was a problem hiding this comment.
This length would only have to change if we expected to be fetching more than 4GB in a single request, no?
| /// Note the returned type is a boxed future, often created by | ||
| /// [FutureExt::boxed]. See the trait documentation for an example | ||
| fn fetch_suffix(&mut self, suffix: usize) -> BoxFuture<'_, Result<Bytes>>; | ||
| fn fetch_suffix(&mut self, suffix: u64) -> BoxFuture<'_, Result<Bytes>>; |
There was a problem hiding this comment.
Likewise, we'll never be fetching more than 4GB in a suffix request for metadata
| metadata: ParquetMetaData, | ||
| /// The offset and bytes of remaining unparsed data | ||
| remainder: Option<(usize, Bytes)>, | ||
| remainder: Option<(u64, Bytes)>, |
There was a problem hiding this comment.
This is an offset into the cached remainder, which is a buffer of the prefetch_size I believe. Assuming that we'll never be prefetching >4GB of data for the metadata, this shouldn't need to change.
There was a problem hiding this comment.
Yeah, you are right -- we don't need to change this.
| pub(crate) fn column_index_range(&self) -> Option<Range<u64>> { | ||
| let offset = u64::try_from(self.column_index_offset?).ok()?; | ||
| let length = u64::try_from(self.column_index_length?).ok()?; |
There was a problem hiding this comment.
I think these might be valid cases that we'd need to change
|
|
||
| impl<T: AsyncFileReader + MetadataFetch + AsyncRead + AsyncSeek + Unpin> MetadataSuffixFetch for T { | ||
| fn fetch_suffix(&mut self, suffix: usize) -> BoxFuture<'_, Result<Bytes>> { | ||
| fn fetch_suffix(&mut self, suffix: u64) -> BoxFuture<'_, Result<Bytes>> { |
There was a problem hiding this comment.
I think we would also have to change the suffix length as well as all other places that use usize as an offset in a file.
There was a problem hiding this comment.
I don't think we necessarily would need to change it in the signature, but we would need to ensure that we cast the suffix to u64 internally rather than casting the length to usize.
There was a problem hiding this comment.
suffix is the length of the returned Bytes, so I'd imagine this should remain a usize.
| // Size of the serialized thrift metadata plus the 8 byte footer. Only set if | ||
| // `self.parse_metadata` is called. | ||
| metadata_size: Option<usize>, | ||
| prefetch_hint: Option<u64>, |
There was a problem hiding this comment.
As these are lengths, they probably should be u64 as well (though I realize it is unlikely people will be "pre" fetching GBs of data
There was a problem hiding this comment.
I think usize is appropriate here...one wouldn't be able to fetch more than 4GB on a 32-bit machine anyway. Also, there are only 4 bytes available in the footer for the metadata length, so metadata_size could probably be change to a u32 rather than u64.
| mut self, | ||
| fetch: F, | ||
| file_size: usize, | ||
| file_size: u64, |
There was a problem hiding this comment.
I think these various load APIs also need to be updated to use a u64 file length to support larger files in parquet
|
|
||
| fn get_bytes(&self, start: u64, length: usize) -> Result<Bytes> { | ||
| let mut buffer = Vec::with_capacity(length); | ||
| fn get_bytes(&self, start: u64, length: u64) -> Result<Bytes> { |
There was a problem hiding this comment.
I think by @tustvold's comment in apache#7371 this one in theory doesn't need to be updated (as the length bytes needs to fit into memory / Bytes anyways)
| /// This is parsed from the last 8 bytes of the Parquet file | ||
| pub struct FooterTail { | ||
| metadata_length: usize, | ||
| metadata_length: u64, |
| /// Error when the requested column index is more than the | ||
| /// number of columns in the row group | ||
| IndexOutOfBound(usize, usize), | ||
| IndexOutOfBound(u64, u64), |
There was a problem hiding this comment.
This is an index into an array or vector, so this should remain usize IMO
| /// Returned when a function needs more data to complete properly. The `usize` field indicates | ||
| /// Returned when a function needs more data to complete properly. The `u64` field indicates | ||
| /// the total number of bytes required, not the number of additional bytes. | ||
| NeedMoreData(usize), | ||
| NeedMoreData(u64), |
There was a problem hiding this comment.
This is thrown when a provided buffer needs to have more data in it. Here too it doesn't make sense to ask for more bytes than a usize can index.
Note: -Targets apache#7371
This gives some idea of what other APIs would need to be changed to fully and properly support > 4GB files in WASM (32 bit
usize)I may be mistaken, but I wanted to get this PR up for discussion